<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" href="../../aosa.css" type="text/css">
    <title>500 Lines or Less: DBDB: Dog Bed Database</title>
  </head>
  <body>

    <div class="titlebox">
      <h1>500 Lines or Less<br>DBDB: Dog Bed Database</h1>
      <p class="author">Taavi Burns</p>
    </div>

    <p><em>As the newest bass (and sometimes tenor) in <a href="http://www.countermeasuremusic.com">Countermeasure</a>, Taavi strives to break the mould... sometimes just by ignoring its existence. This is certainly true through the diversity of workplaces in his career: IBM (doing C and Perl), FreshBooks (all the things), Points.com (doing Python), and now at PagerDuty (doing Scala). Aside from that—when not gliding along on his Brompton folding bike—you might find him playing Minecraft with his son or engaging in parkour (or rock climbing, or other adventures) with his wife. He knits continental.</em></p>

<h2 id="introduction">Introduction</h2>

<p>DBDB (Dog Bed Database) is a Python library that implements a simple key/value database. It lets you associate a key with a value, and store that association on disk for later retrieval.</p>

<p>DBDB aims to preserve data in the face of computer crashes and error conditions. It also avoids holding all data in RAM at once so you can store more data than you have RAM.</p>

<h2 id="memory">Memory</h2>

<p>I remember the first time I was really stuck on a bug. When I finished typing in my BASIC program and ran it, weird sparkly pixels showed up on the screen, and the program aborted early. When I went back to look at the code, the last few lines of the program were gone.</p>

<p>One of my mom's friends knew how to program, so we set up a call. Within a few minutes of speaking with her, I found the problem: the program was too big, and had encroached onto video memory. Clearing the screen truncated the program, and the sparkles were artifacts of Applesoft BASIC's behaviour of storing program state in RAM just beyond the end of the program.</p>

<p>From that moment onwards, I cared about memory allocation. I learned about pointers and how to allocate memory with malloc. I learned how my data structures were laid out in memory. And I learned to be very, very careful about how I changed them.</p>

<p>Some years later, while reading about a process-oriented language called Erlang, I learned that it didn't actually have to copy data to send messages between processes, because everything was immutable. I then discovered immutable data structures in Clojure, and it really began to sink in.</p>

<p>When I read about CouchDB in 2013, I just smiled and nodded, recognising the structures and mechanisms for managing complex data as it changes.</p>

<p>I learned that you can design systems built around immutable data.</p>

<p>Then I agreed to write a book chapter.</p>

<p>I thought that describing the core data storage concepts of CouchDB (as I understood them) would be fun.</p>

<p>While trying to write a binary tree algorithm that mutated the tree in place, I got frustrated with how complicated things were getting. The number of edge cases and trying to reason about how changes in one part of the tree affected others was making my head hurt. I had no idea how I was going to explain all of this.</p>

<p>Remembering lessons learned, I took a peek at a recursive algorithm for updating immutable binary trees and it turned out to be relatively straightforward.</p>

<p>I learned, once again, that it's easier to reason about things that don't change.</p>

<p>So starts the story.</p>

<h2 id="why-is-it-interesting">Why Is it Interesting?</h2>

<p>Most projects require a database of some kind. You really shouldn't write your own; there are many edge cases that will bite you, even if you're just writing JSON to disk:</p>

<ul>
<li>What happens if your filesystem runs out of space?</li>
<li>What happens if your laptop battery dies while saving?</li>
<li>What if your data size exceeds available memory? (Unlikely for most applications on modern desktop computers… but not unlikely for a mobile device or server-side web application.)</li>
</ul>

<p>However, if you want to <em>understand</em> how a database handles all of these problems, writing one for yourself can be a good idea.</p>

<p>The techniques and concepts we discuss here should be applicable to any problem that needs to have rational, predictable behaviour when faced with failure.</p>

<p>Speaking of failure...</p>

<h2 id="characterizing-failure">Characterizing Failure</h2>

<p>Databases are often characterized by how closely they adhere to the ACID properties: atomicity, consistency, isolation, and durability.</p>

<p>Updates in DBDB are atomic and durable, two attributes which are described later in the chapter. DBDB provides no consistency guarantees as there are no constraints on the data stored. Isolation is likewise not implemented.</p>

<p>Application code can, of course, impose its own consistency guarantees, but proper isolation requires a transaction manager. We won't attempt that here; however, you can learn more about transaction management in the <a href="http://aosabook.org/en/500L/an-archaeology-inspired-database.html">CircleDB chapter</a>.</p>

<p>We also have other system-maintenance problems to think about. Stale data is not reclaimed in this implementation, so repeated updates (even to the same key) will eventually consume all disk space. (You will shortly discover why this is the case.) <a href="http://www.postgresql.org/">PostgreSQL</a> calls this reclamation &quot;vacuuming&quot; (which makes old row space available for re-use), and <a href="http://couchdb.apache.org/">CouchDB</a> calls it &quot;compaction&quot; (by rewriting the &quot;live&quot; parts of the data into a new file, and atomically moving it over the old one).</p>

<p>DBDB could be enhanced to add a compaction feature, but it is left as an exercise for the reader<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>

<h2 id="the-architecture-of-dbdb">The Architecture of DBDB</h2>

<p>DBDB separates the concerns of &quot;put this on disk somewhere&quot; (how data are laid out in a file; the physical layer) from the logical structure of the data (a binary tree in this example; the logical layer) from the contents of the key/value store (the association of key <code>a</code> to value <code>foo</code>; the public API).</p>

<p>Many databases separate the logical and physical aspects as it is is often useful to provide alternative implementations of each to get different performance characteristics, e.g. DB2's SMS (files in a filesystem) versus DMS (raw block device) tablespaces, or MySQL's <a href="http://dev.mysql.com/doc/refman/5.7/en/storage-engines.html">alternative engine implementations</a>.</p>

<h2 id="discovering-the-design">Discovering the Design</h2>

<p>Most of the chapters in this book describe how a program was built from inception to completion. However, that is not how most of us interact with the code we're working on. We most often discover code that was written by others, and figure out how to modify or extend it to do something different.</p>

<p>In this chapter, we'll assume that DBDB is a completed project, and walk through it to learn how it works. Let's explore the structure of the entire project first.</p>

<h3 id="organisational-units">Organisational Units</h3>

<p>Units are ordered here by distance from the end user; that is, the first module is the one that a user of this program would likely need to know the most about, while the last is something they should have very little interaction with.</p>

<ul>
<li><p><code>tool.py</code> defines a command-line tool for exploring a database from a terminal window.</p></li>
<li><p><code>interface.py</code> defines a class (<code>DBDB</code>) which implements the Python dictionary API using the concrete <code>BinaryTree</code> implementation. This is how you'd use DBDB inside a Python program.</p></li>
<li><p><code>logical.py</code> defines the logical layer. It's an abstract interface to a key/value store.</p>
<ul>
<li><p><code>LogicalBase</code> provides the API for logical updates (like get, set, and commit) and defers to a concrete subclass to implement the updates themselves. It also manages storage locking and dereferencing internal nodes.</p></li>
<li><p><code>ValueRef</code> is a Python object that refers to a binary blob stored in the database. The indirection lets us avoid loading the entire data store into memory all at once.</p></li>
</ul></li>
<li><p><code>binary_tree.py</code> defines a concrete binary tree algorithm underneath the logical interface.</p>
<ul>
<li><p><code>BinaryTree</code> provides a concrete implementation of a binary tree, with methods for getting, inserting, and deleting key/value pairs. <code>BinaryTree</code> represents an immutable tree; updates are performed by returning a new tree which shares common structure with the old one.</p></li>
<li><p><code>BinaryNode</code> implements a node in the binary tree.</p></li>
<li><p><code>BinaryNodeRef</code> is a specialised <code>ValueRef</code> which knows how to serialise and deserialise a <code>BinaryNode</code>.</p></li>
</ul></li>
<li><p><code>physical.py</code> defines the physical layer. The <code>Storage</code> class provides persistent, (mostly) append-only record storage.</p></li>
</ul>

<p>These modules grew from attempting to give each class a single responsibility. In other words, each class should have only one reason to change.</p>

<h3 id="reading-a-value">Reading a Value</h3>

<p>We'll start with the simplest case: reading a value from the database. Let's see what happens when we try to get the value associated with key <code>foo</code> in <code>example.db</code>:</p>

<pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">python</span> -m dbdb.tool example.db get foo</code></pre>

<p>This runs the <code>main()</code> function from module <code>dbdb.tool</code>:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/tool.py</span>
<span class="kw">def</span> main(argv):
    <span class="kw">if</span> not (<span class="dv">4</span> &lt;= <span class="dt">len</span>(argv) &lt;= <span class="dv">5</span>):
        usage()
        <span class="kw">return</span> BAD_ARGS
    dbname, verb, key, value = (argv[<span class="dv">1</span>:] + [<span class="ot">None</span>])[:<span class="dv">4</span>]
    <span class="kw">if</span> verb not in {<span class="st">&#39;get&#39;</span>, <span class="st">&#39;set&#39;</span>, <span class="st">&#39;delete&#39;</span>}:
        usage()
        <span class="kw">return</span> BAD_VERB
    db = dbdb.<span class="ot">connect</span>(dbname)          <span class="co"># CONNECT</span>
    <span class="kw">try</span>:
        <span class="kw">if</span> verb == <span class="st">&#39;get&#39;</span>:
            sys.stdout.write(db[key])  <span class="co"># GET VALUE</span>
        <span class="kw">elif</span> verb == <span class="st">&#39;set&#39;</span>:
            db[key] = value
            db.commit()
        <span class="kw">else</span>:
            <span class="kw">del</span> db[key]
            db.commit()
    <span class="kw">except</span> <span class="ot">KeyError</span>:
        <span class="dt">print</span>(<span class="st">&quot;Key not found&quot;</span>, <span class="dt">file</span>=sys.stderr)
        <span class="kw">return</span> BAD_KEY
    <span class="kw">return</span> OK</code></pre>

<p>The <code>connect()</code> function opens the database file (possibly creating it, but never overwriting it) and returns an instance of <code>DBDB</code>:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/__init__.py</span>
<span class="kw">def</span> <span class="ot">connect</span>(dbname):
    <span class="kw">try</span>:
        f = <span class="dt">open</span>(dbname, <span class="st">&#39;r+b&#39;</span>)
    <span class="kw">except</span> <span class="ot">IOError</span>:
        fd = os.<span class="dt">open</span>(dbname, os.O_RDWR | os.O_CREAT)
        f = os.fdopen(fd, <span class="st">&#39;r+b&#39;</span>)
    <span class="kw">return</span> DBDB(f)</code></pre>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/interface.py</span>
<span class="kw">class</span> DBDB(<span class="dt">object</span>):

    <span class="kw">def</span> <span class="ot">__init__</span>(<span class="ot">self</span>, f):
        <span class="ot">self</span>._storage = Storage(f)
        <span class="ot">self</span>._tree = BinaryTree(<span class="ot">self</span>._storage)</code></pre>

<p>We see right away that <code>DBDB</code> has a reference to an instance of <code>Storage</code>, but it also shares that reference with <code>self._tree</code>. Why? Can't <code>self._tree</code> manage access to the storage by itself?</p>

<p>The question of which objects &quot;own&quot; a resource is often an important one in a design, because it gives us hints about what changes might be unsafe. Let's keep that question in mind as we move on.</p>

<p>Once we have a DBDB instance, getting the value at <code>key</code> is done via a dictionary lookup (<code>db[key]</code>), which causes the Python interpreter to call <code>DBDB.__getitem__()</code>.</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/interface.py</span>
<span class="kw">class</span> DBDB(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> <span class="ot">__getitem__</span>(<span class="ot">self</span>, key):
        <span class="ot">self</span>._assert_not_closed()
        <span class="kw">return</span> <span class="ot">self</span>._tree.get(key)

    <span class="kw">def</span> _assert_not_closed(<span class="ot">self</span>):
        <span class="kw">if</span> <span class="ot">self</span>._storage.closed:
            <span class="kw">raise</span> <span class="ot">ValueError</span>(<span class="st">&#39;Database closed.&#39;</span>)</code></pre>

<p><code>__getitem__()</code> ensures that the database is still open by calling <code>_assert_not_closed</code>. Aha! Here we see at least one reason why <code>DBDB</code> needs direct access to our <code>Storage</code> instance: so it can enforce preconditions. (Do you agree with this design? Can you think of a different way that we could do this?)</p>

<p>DBDB then retrieves the value associated with <code>key</code> on the internal <code>_tree</code> by calling <code>_tree.get()</code>, which is provided by <code>LogicalBase</code>:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/logical.py</span>
<span class="kw">class</span> LogicalBase(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> get(<span class="ot">self</span>, key):
        <span class="kw">if</span> not <span class="ot">self</span>._storage.locked:
            <span class="ot">self</span>._refresh_tree_ref()
        <span class="kw">return</span> <span class="ot">self</span>._get(<span class="ot">self</span>._follow(<span class="ot">self</span>._tree_ref), key)</code></pre>

<p><code>get()</code> checks if we have the storage locked. We're not 100% sure <em>why</em> there might be a lock here, but we can guess that it probably exists to allow writers to serialize access to the data. What happens if the storage isn't locked?</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/logical.py</span>
<span class="kw">class</span> LogicalBase(<span class="dt">object</span>):
<span class="co"># ...</span>
<span class="kw">def</span> _refresh_tree_ref(<span class="ot">self</span>):
        <span class="ot">self</span>._tree_ref = <span class="ot">self</span>.node_ref_class(
            address=<span class="ot">self</span>._storage.get_root_address())</code></pre>

<p><code>_refresh_tree_ref</code> resets the tree's &quot;view&quot; of the data with what is currently on disk, allowing us to perform a completely up-to-date read.</p>

<p>What if storage <em>is</em> locked when we attempt a read? This means that some other process is probably changing the data we want to read right now; our read is not likely to be up-to-date with the current state of the data. This is generally known as a &quot;dirty read&quot;. This pattern allows many readers to access data without ever worrying about blocking, at the expense of being slightly out-of-date.</p>

<p>For now, let's take a look at how we actually retrieve the data:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/binary_tree.py</span>
<span class="kw">class</span> BinaryTree(LogicalBase):
<span class="co"># ...</span>
    <span class="kw">def</span> _get(<span class="ot">self</span>, node, key):
        <span class="kw">while</span> node is not <span class="ot">None</span>:
            <span class="kw">if</span> key &lt; node.key:
                node = <span class="ot">self</span>._follow(node.left_ref)
            <span class="kw">elif</span> node.key &lt; key:
                node = <span class="ot">self</span>._follow(node.right_ref)
            <span class="kw">else</span>:
                <span class="kw">return</span> <span class="ot">self</span>._follow(node.value_ref)
        <span class="kw">raise</span> <span class="ot">KeyError</span></code></pre>

<p>This is a standard binary tree search, following refs to their nodes. We know from reading the <code>BinaryTree</code> documentation that <code>Node</code>s and <code>NodeRef</code>s are value objects: they are immutable and their contents never change. <code>Node</code>s are created with an associated key and value, and left and right children. Those associations also never change. The content of the whole <code>BinaryTree</code> only visibly changes when the root node is replaced. This means that we don't need to worry about the contents of our tree being changed while we are performing the search.</p>

<p>Once the associated value is found, it is written to <code>stdout</code> by <code>main()</code> without adding any extra newlines, to preserve the user's data exactly.</p>

<h4 id="inserting-and-updating">Inserting and Updating</h4>

<p>Now we'll set key <code>foo</code> to value <code>bar</code> in <code>example.db</code>:</p>

<pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">python</span> -m dbdb.tool example.db set foo bar</code></pre>

<p>Again, this runs the <code>main()</code> function from module <code>dbdb.tool</code>. Since we've seen this code before, we'll just highlight the important parts:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/tool.py</span>
<span class="kw">def</span> main(argv):
    ...
    db = dbdb.<span class="ot">connect</span>(dbname)          <span class="co"># CONNECT</span>
    <span class="kw">try</span>:
        ...
        <span class="kw">elif</span> verb == <span class="st">&#39;set&#39;</span>:
            db[key] = value            <span class="co"># SET VALUE</span>
            db.commit()                <span class="co"># COMMIT</span>
        ...
    <span class="kw">except</span> <span class="ot">KeyError</span>:
        ...</code></pre>

<p>This time we set the value with <code>db[key] = value</code> which calls <code>DBDB.__setitem__()</code>.</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/interface.py</span>
<span class="kw">class</span> DBDB(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> <span class="ot">__setitem__</span>(<span class="ot">self</span>, key, value):
        <span class="ot">self</span>._assert_not_closed()
        <span class="kw">return</span> <span class="ot">self</span>._tree.<span class="dt">set</span>(key, value)</code></pre>

<p><code>__setitem__</code> ensures that the database is still open and then stores the association from <code>key</code> to <code>value</code> on the internal <code>_tree</code> by calling <code>_tree.set()</code>.</p>

<p><code>_tree.set()</code> is provided by <code>LogicalBase</code>:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/logical.py</span>
<span class="kw">class</span> LogicalBase(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> <span class="dt">set</span>(<span class="ot">self</span>, key, value):
        <span class="kw">if</span> <span class="ot">self</span>._storage.lock():
            <span class="ot">self</span>._refresh_tree_ref()
        <span class="ot">self</span>._tree_ref = <span class="ot">self</span>._insert(
            <span class="ot">self</span>._follow(<span class="ot">self</span>._tree_ref), key, <span class="ot">self</span>.value_ref_class(value))</code></pre>

<p><code>set()</code> first checks the storage lock:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/storage.py</span>
<span class="kw">class</span> Storage(<span class="dt">object</span>):
    ...
    <span class="kw">def</span> lock(<span class="ot">self</span>):
        <span class="kw">if</span> not <span class="ot">self</span>.locked:
            portalocker.lock(<span class="ot">self</span>._f, portalocker.LOCK_EX)
            <span class="ot">self</span>.locked = <span class="ot">True</span>
            <span class="kw">return</span> <span class="ot">True</span>
        <span class="kw">else</span>:
            <span class="kw">return</span> <span class="ot">False</span></code></pre>

<p>There are two important things to note here:</p>

<ul>
<li>Our lock is provided by a 3rd-party file-locking library called <a href="https://pypi.python.org/pypi/portalocker">portalocker</a>.</li>
<li><code>lock()</code> returns <code>False</code> if the database was already locked, and <code>True</code> otherwise.</li>
</ul>

<p>Returning to <code>_tree.set()</code>, we can now understand why it checked the return value of <code>lock()</code> in the first place: it lets us call <code>_refresh_tree_ref</code> for the most recent root node reference so we don't lose updates that another process may have made since we last refreshed the tree from disk. Then it replaces the root tree node with a new tree containing the inserted (or updated) key/value.</p>

<p>Inserting or updating the tree doesn't mutate any nodes, because <code>_insert()</code> returns a new tree. The new tree shares unchanged parts with the previous tree to save on memory and execution time. It's natural to implement this recursively:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/binary_tree.py</span>
<span class="kw">class</span> BinaryTree(LogicalBase):
<span class="co"># ...</span>
    <span class="kw">def</span> _insert(<span class="ot">self</span>, node, key, value_ref):
        <span class="kw">if</span> node is <span class="ot">None</span>:
            new_node = BinaryNode(
                <span class="ot">self</span>.node_ref_class(), key, value_ref, <span class="ot">self</span>.node_ref_class(), <span class="dv">1</span>)
        <span class="kw">elif</span> key &lt; node.key:
            new_node = BinaryNode.from_node(
                node,
                left_ref=<span class="ot">self</span>._insert(
                    <span class="ot">self</span>._follow(node.left_ref), key, value_ref))
        <span class="kw">elif</span> node.key &lt; key:
            new_node = BinaryNode.from_node(
                node,
                right_ref=<span class="ot">self</span>._insert(
                    <span class="ot">self</span>._follow(node.right_ref), key, value_ref))
        <span class="kw">else</span>:
            new_node = BinaryNode.from_node(node, value_ref=value_ref)
        <span class="kw">return</span> <span class="ot">self</span>.node_ref_class(referent=new_node)</code></pre>

<p>Notice how we always return a new node (wrapped in a <code>NodeRef</code>). Instead of updating a node to point to a new subtree, we make a new node which shares the unchanged subtree. This is what makes this binary tree an immutable data structure.</p>

<p>You may have noticed something strange here: we haven't made any changes to anything on disk yet. All we've done is manipulate our view of the on-disk data by moving tree nodes around.</p>

<p>In order to actually write these changes to disk, we need an explicit call to <code>commit()</code>, which we saw as the second part of our <code>set</code> operation in <code>tool.py</code> at the beginning of this section.</p>

<p>Committing involves writing out all of the dirty state in memory, and then saving the disk address of the tree's new root node.</p>

<p>Starting from the API:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/interface.py</span>
<span class="kw">class</span> DBDB(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> commit(<span class="ot">self</span>):
        <span class="ot">self</span>._assert_not_closed()
        <span class="ot">self</span>._tree.commit()</code></pre>

<p>The implementation of <code>_tree.commit()</code> comes from <code>LogicalBase</code>:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/logical.py</span>
<span class="kw">class</span> LogicalBase(<span class="dt">object</span>)
<span class="co"># ...</span>
    <span class="kw">def</span> commit(<span class="ot">self</span>):
        <span class="ot">self</span>._tree_ref.store(<span class="ot">self</span>._storage)
        <span class="ot">self</span>._storage.commit_root_address(<span class="ot">self</span>._tree_ref.address)</code></pre>

<p>All <code>NodeRef</code>s know how to serialise themselves to disk by first asking their children to serialise via <code>prepare_to_store()</code>:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/logical.py</span>
<span class="kw">class</span> ValueRef(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> store(<span class="ot">self</span>, storage):
        <span class="kw">if</span> <span class="ot">self</span>._referent is not <span class="ot">None</span> and not <span class="ot">self</span>._address:
            <span class="ot">self</span>.prepare_to_store(storage)
            <span class="ot">self</span>._address = storage.write(<span class="ot">self</span>.referent_to_string(<span class="ot">self</span>._referent))</code></pre>

<p><code>self._tree_ref</code> in <code>LogicalBase</code> is actually a <code>BinaryNodeRef</code> (a subclass of <code>ValueRef</code>) in this case, so the concrete implementation of <code>prepare_to_store()</code> is:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/binary_tree.py</span>
<span class="kw">class</span> BinaryNodeRef(ValueRef):
    <span class="kw">def</span> prepare_to_store(<span class="ot">self</span>, storage):
        <span class="kw">if</span> <span class="ot">self</span>._referent:
            <span class="ot">self</span>._referent.store_refs(storage)</code></pre>

<p>The <code>BinaryNode</code> in question, <code>_referent</code>, asks its refs to store themselves:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/binary_tree.py</span>
<span class="kw">class</span> BinaryNode(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> store_refs(<span class="ot">self</span>, storage):
        <span class="ot">self</span>.value_ref.store(storage)
        <span class="ot">self</span>.left_ref.store(storage)
        <span class="ot">self</span>.right_ref.store(storage)</code></pre>

<p>This recurses all the way down for any <code>NodeRef</code> which has unwritten changes (i.e., no <code>_address</code>).</p>

<p>Now we're back up the stack in <code>ValueRef</code>'s <code>store</code> method again. The last step of <code>store()</code> is to serialise this node and save its storage address:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/logical.py</span>
<span class="kw">class</span> ValueRef(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> store(<span class="ot">self</span>, storage):
        <span class="kw">if</span> <span class="ot">self</span>._referent is not <span class="ot">None</span> and not <span class="ot">self</span>._address:
            <span class="ot">self</span>.prepare_to_store(storage)
            <span class="ot">self</span>._address = storage.write(<span class="ot">self</span>.referent_to_string(<span class="ot">self</span>._referent))</code></pre>

<p>At this point the <code>NodeRef</code>'s <code>_referent</code> is guaranteed to have addresses available for all of its own refs, so we serialise it by creating a bytestring representing this node:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/binary_tree.py</span>
<span class="kw">class</span> BinaryNodeRef(ValueRef):
<span class="co"># ...</span>
    <span class="ot">@staticmethod</span>
    <span class="kw">def</span> referent_to_string(referent):
        <span class="kw">return</span> pickle.dumps({
            <span class="st">&#39;left&#39;</span>: referent.left_ref.address,
            <span class="st">&#39;key&#39;</span>: referent.key,
            <span class="co">&#39;value&#39;</span>: referent.value_ref.address,
            <span class="co">&#39;right&#39;</span>: referent.right_ref.address,
            <span class="co">&#39;length&#39;</span>: referent.length,
        })</code></pre>

<p>Updating the address in the <code>store()</code> method is technically a mutation of the <code>ValueRef</code>. Because it has no effect on the user-visible value, we can consider it to be immutable.</p>

<p>Once <code>store()</code> on the root <code>_tree_ref</code> is complete (in <code>LogicalBase.commit()</code>), we know that all of the data are written to disk. We can now commit the root address by calling:</p>

<pre class="sourceCode python"><code class="sourceCode python"><span class="co"># dbdb/physical.py</span>
<span class="kw">class</span> Storage(<span class="dt">object</span>):
<span class="co"># ...</span>
    <span class="kw">def</span> commit_root_address(<span class="ot">self</span>, root_address):
        <span class="ot">self</span>.lock()
        <span class="ot">self</span>._f.flush()
        <span class="ot">self</span>._seek_superblock()
        <span class="ot">self</span>._write_integer(root_address)
        <span class="ot">self</span>._f.flush()
        <span class="ot">self</span>.unlock()</code></pre>

<p>We ensure that the file handle is flushed (so that the OS knows we want all the data saved to stable storage like an SSD) and write out the address of the root node. We know this last write is atomic because we store the disk address on a sector boundary. It's the very first thing in the file, so this is true regardless of sector size, and single-sector disk writes are guaranteed to be atomic by the disk hardware.</p>

<p>Because the root node address has either the old or new value (never a bit of old and a bit of new), other processes can read from the database without getting a lock. An external process might see the old or the new tree, but never a mix of the two. In this way, commits are atomic.</p>

<p>Because we write the new data to disk and call the <code>fsync</code> syscall<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> before we write the root node address, uncommitted data are unreachable. Conversely, once the root node address has been updated, we know that all the data it references are also on disk. In this way, commits are also durable.</p>

<p>We're done!</p>

<h3 id="how-noderefs-save-memory">How NodeRefs Save Memory</h3>

<p>To avoid keeping the entire tree structure in memory at the same time, when a logical node is read in from disk the disk address of its left and right children (as well as its value) are loaded into memory. Accessing children and their values requires one extra function call to <code>NodeRef.get()</code> to dereference (&quot;really get&quot;) the data.</p>

<p>All we need to construct a <code>NodeRef</code> is an address:</p>

<pre><code>+---------+
| NodeRef |
| ------- |
| addr=3  |
| get()   |
+---------+</code></pre>

<p>Calling <code>get()</code> on it will return the concrete node, along with that node's references as <code>NodeRef</code>s:</p>

<pre><code>+---------+     +---------+     +---------+
| NodeRef |     | Node    |     | NodeRef |
| ------- |     | ------- | +-&gt; | ------- |
| addr=3  |     | key=A   | |   | addr=1  |
| get() ------&gt; | value=B | |   +---------+
+---------+     | left  ----+
                | right ----+   +---------+
                +---------+ |   | NodeRef |
                            +-&gt; | ------- |
                                | addr=2  |
                                +---------+</code></pre>

<p>When changes to the tree are not committed, they exist in memory with references from the root down to the changed leaves. The changes aren't saved to disk yet, so the changed nodes contain concrete keys and values and no disk addresses. The process doing the writing can see uncommitted changes and can make more changes before issuing a commit, because <code>NodeRef.get()</code> will return the uncommitted value if it has one; there is no difference between committed and uncommitted data when accessed through the API. All the updates will appear atomically to other readers because changes aren't visible until the new root node address is written to disk. Concurrent updates are blocked by a lockfile on disk. The lock is acquired on first update, and released after commit.</p>

<h3 id="exercises-for-the-reader">Exercises for the Reader</h3>

<p>DBDB allows many processes to read the same database at once without blocking; the tradeoff is that readers can sometimes retrieve stale data. What if we needed to be able to read some data consistently? A common use case is reading a value and then updating it based on that value. How would you write a method on <code>DBDB</code> to do this? What tradeoffs would you have to incur to provide this functionality?</p>

<p>The algorithm used to update the data store can be completely changed out by replacing the string <code>BinaryTree</code> in <code>interface.py</code>. Data stores tend to use more complex types of search trees such as B-trees, B+ trees, and others to improve the performance. While a balanced binary tree (and this one isn't) needs to do <span class="math">\(O(log_2(n))\)</span> random node reads to find a value, a B+ tree needs many fewer, for example <span class="math">\(O(log_{32}(n))\)</span> because each node splits 32 ways instead of just 2. This makes a huge different in practice, since looking through 4 billion entries would go from <span class="math">\(log_2(2^{32}) = 32\)</span> to <span class="math">\(log_{32}(2^{32}) \approx 6.4\)</span> lookups. Each lookup is a random access, which is incredibly expensive for hard disks with spinning platters. SSDs help with the latency, but the savings in I/O still stand.</p>

<p>By default, values are stored by <code>ValueRef</code> which expects bytes as values (to be passed directly to <code>Storage</code>). The binary tree nodes themselves are just a sublcass of <code>ValueRef</code>. Storing richer data via <a href="http://json.org">json</a> or <a href="http://msgpack.org">msgpack</a> is a matter of writing your own and setting it as the <code>value_ref_class</code>. <code>BinaryNodeRef</code> is an example of using <a href="https://docs.python.org/3.4/library/pickle.html">pickle</a> to serialise data.</p>

<p>Database compaction is another interesting exercise. Compacting can be done via an infix-of-median traversal of the tree writing things out as you go. It's probably best if the tree nodes all go together, since they're what's traversed to find any piece of data. Packing as many intermediate nodes as possible into a disk sector should improve read performance, at least right after compaction. There are some subtleties here (for example, memory usage) if you try to implement this yourself. And remember: always benchmark performance enhancements before and after! You'll often be surprised by the results.</p>

<h3 id="patterns-and-principles">Patterns and Principles</h3>

<p>Test interfaces, not implementation. As part of developing DBDB, I wrote a number of tests that described how I wanted to be able to use it. The first tests ran against an in-memory version of the database, then I extended DBDB to persist to disk, and even later added the concept of NodeRefs. Most of the tests didn't have to change, which gave me confidence that things were still working.</p>

<p>Respect the Single Responsibility Principle. Classes should have at most one reason to change. That's not strictly the case with DBDB, but there are multiple avenues of extension with only localised changes required. Refactoring as I added features was a pleasure!</p>

<h3 id="summary">Summary</h3>

<p>DBDB is a simple database that makes simple guarantees, and yet things still became complicated in a hurry. The most important thing I did to manage this complexity was to implement an ostensibly mutable object with an immutable data structure. I encourage you to consider this technique the next time you find yourself in the middle of a tricky problem that seems to have more edge cases than you can keep track of.</p>

<div class="footnotes">
<ol>
<li id="fn1"><p>Bonus feature: Can you guarantee that the compacted tree structure is balanced? This helps maintain performance over time.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Calling <code>fsync</code> on a file descriptor asks the operating system and hard drive (or SSD) to write all buffered data immediately. Operating systems and drives don't usually write everything immediately in order to improve performance.<a href="#fnref2">↩</a></p></li>
</ol>
</div>
  </body>
</html>
